Colorado Local Data

Introduction

Don Merrion of the Division of Local Government in the Colorado Department of Local Affairs was kind of enough to compile and share data capturing information about county and local finances over the 1975-2009 period. He actually gave this to me months ago, but World Bank work (pretty tough to be dismayed by that) and frequent travel have hindered my exploration of it. At long last, I am going to take a look at what I have here. The data reside on home/choct155/Google Drive/Diseertation/Data/.

In [59]:
pwd
Out[59]:
u'/home/choct155/Google Drive/Dissertation/Data'
In [60]:
ls
02fingid.zip                          fin04gid.zip
97ItemCodes0500.txt                   fin05gid.zip
allgmp.csv                            gmpcompare.sas
BEA/                                  gmp.sas7bdat
bearfacts(1).zip                      govfinagg.sas
Census/                               indagg.sas7bdat
City_Govt_Finances.mdb                MetroGDP.csv
City Level Finances (1951-2006).xlsx  MUNY_ENT_1975_to_2009.csv
CO_muni_county.ipynb                  MUNY_GG_1975_to_2009.csv
ConsumerExpFields.txt                 PCGMP.csv
COUNTY_ENT_1975_TO_2009.csv           State by Level of Gov/
COUNTY_GG_1975_TO_2009.csv            statefips.csv
Data/                                 statefips.sas7bdat
dpga.csv                              totgmp (2).sas7bdat
econvariance.R                        totgmp.sas7bdat
fin03gid.zip

In [358]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import pandas.io.data as web
from IPython.display import HTML


#Set print widith
pd.set_option('line_width',140)

#Set plot style
pd.options.display.mpl_style='default'

#Establish working directory
workdir='/home/choct155/Google Drive/Dissertation/Data/'

#Read in data
county_ent=pd.read_csv(workdir+'COUNTY_ENT_1975_TO_2009.csv')
county_gg=pd.read_csv(workdir+'COUNTY_GG_1975_TO_2009.csv')
muni_ent=pd.read_csv(workdir+'MUNY_ENT_1975_to_2009.csv')
muni_gg=pd.read_csv(workdir+'MUNY_GG_1975_to_2009.csv')

print county_ent
print county_gg
print muni_ent
print muni_gg
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2573 entries, 0 to 2572
Data columns (total 25 columns):
LG_ID                 2573  non-null values
NAME                  2573  non-null values
LGTYPE_ID             2573  non-null values
AUDIT_YEAR            2573  non-null values
CM_FUND_TYPE_ID       2573  non-null values
DESCRIPTION           2573  non-null values
REV_TOTAL             2573  non-null values
REV_CHARGES           2573  non-null values
REV_IMPACT_FEE        2572  non-null values
REV_INTGOVT           2572  non-null values
REV_TRANSFER_IN       2573  non-null values
REV_INTEREST          2573  non-null values
REV_OTHER             2573  non-null values
EXP_TOTAL             2573  non-null values
EXP_OPERATING         2573  non-null values
EXP_CAPITAL_OUTLAY    2573  non-null values
EXP_PRINCIPAL         2573  non-null values
EXP_INTEREST          2573  non-null values
EXP_TRANSFER_OUT      2573  non-null values
GO_DEBT               2572  non-null values
REVENUE_DEBT          2572  non-null values
OTHER_DEBT            2572  non-null values
ASSETS                2572  non-null values
LIABILITIES           2572  non-null values
DEPRECIATION          2570  non-null values
dtypes: float64(8), int64(15), object(2)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2240 entries, 0 to 2239
Data columns (total 58 columns):
LG_ID                           2240  non-null values
NAME                            2240  non-null values
LGTYPE_ID                       2240  non-null values
AUDIT_YEAR                      2240  non-null values
REV_TOTAL                       2240  non-null values
REV_TOTAL_TAX                   2240  non-null values
REV_PROPERTY_TAX                2240  non-null values
REV_SO_TAX                      2240  non-null values
REV_SALES_USE_TAX               2240  non-null values
REV_OCCUPATION_TAX              2240  non-null values
REV_FRANCHISE_TAX               2240  non-null values
REV_OTHER_TAX                   2240  non-null values
REV_LODGING_TAX                 2240  non-null values
REV_REAL_ESTATE_TRANSFER_TAX    2240  non-null values
REV_UNCLASS_TAX                 2240  non-null values
REV_LICENSES                    2240  non-null values
REV_CHARGES                     2240  non-null values
REV_FINES                       2240  non-null values
REV_TRANSFER_IN                 2240  non-null values
REV_INTGOVT                     2240  non-null values
REV_HUT                         2240  non-null values
REV_CIGARETTE_TAX               2240  non-null values
REV_MOTOR_VEH_FEE               2240  non-null values
REV_CTF                         2240  non-null values
REV_SOCIAL_SERVICE              2240  non-null values
ALL_OTHER_INTGOVT               2240  non-null values
REV_MISC                        2240  non-null values
REV_INTEREST                    2240  non-null values
EXP_TOTAL                       2240  non-null values
EXP_TOTAL_OPERATING             2240  non-null values
EXP_GEN_GOVT                    2240  non-null values
EXP_JUDICIAL                    2240  non-null values
EXP_TOTAL_PUBLIC_SAFETY         2240  non-null values
EXP_POLICE                      2240  non-null values
EXP_FIRE                        2240  non-null values
EXP_OTHER_PUBLIC_SAFETY         2240  non-null values
EXP_TOTAL_PUBLIC_WORKS          2240  non-null values
EXP_STREET                      2240  non-null values
EXP_TRASH                       2240  non-null values
EXP_OTHER_PUBLIC_WORKS          2240  non-null values
EXP_HEALTH                      2240  non-null values
EXP_RECREATION                  2240  non-null values
EXP_SOCIAL_SERVICE              2240  non-null values
EXP_MISC                        2240  non-null values
EXP_TRANSFER_OUT                2240  non-null values
EXP_CAPITAL_OUTLAY              2240  non-null values
EXP_DEBT_SERVICE_GEN            2240  non-null values
EXP_PRINCIPAL_GEN               2240  non-null values
EXP_INTEREST_GEN                2240  non-null values
GO_DEBT_GEN                     2240  non-null values
REVENUE_DEBT_GEN                2240  non-null values
OTHER_DEBT_GEN                  2240  non-null values
ASSETS                          2240  non-null values
LIABILITIES                     2240  non-null values
POPULATION                      2240  non-null values
RETAIL_SALES                    2240  non-null values
ST_SALES_TAX_PAID               2240  non-null values
SALES_TAX_RATE                  2240  non-null values
dtypes: float64(1), int64(56), object(1)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16100 entries, 0 to 16099
Data columns (total 24 columns):
LG_ID                 16100  non-null values
NAME                  16100  non-null values
AUDIT_YEAR            16100  non-null values
CM_FUND_TYPE_ID       16100  non-null values
DESCRIPTION           16100  non-null values
REV_TOTAL             16100  non-null values
REV_CHARGES           16100  non-null values
REV_IMPACT_FEE        16100  non-null values
REV_INTGOVT           16100  non-null values
REV_TRANSFER_IN       16100  non-null values
REV_INTEREST          16098  non-null values
REV_OTHER             16098  non-null values
EXP_TOTAL             16100  non-null values
EXP_OPERATING         16100  non-null values
EXP_CAPITAL_OUTLAY    16099  non-null values
EXP_PRINCIPAL         16098  non-null values
EXP_INTEREST          16097  non-null values
EXP_TRANSFER_OUT      16096  non-null values
GO_DEBT               16096  non-null values
REVENUE_DEBT          16096  non-null values
OTHER_DEBT            16096  non-null values
ASSETS                16098  non-null values
LIABILITIES           16097  non-null values
DEPRECIATION          16093  non-null values
dtypes: float64(12), int64(10), object(2)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9283 entries, 0 to 9282
Data columns (total 56 columns):
LG_ID                      9283  non-null values
NAME                       9283  non-null values
AUDIT_YEAR                 9283  non-null values
REV_TOTAL                  9283  non-null values
REV_TOTAL_TAX              9283  non-null values
REV_PROPERTY_TAX           9283  non-null values
REV_SO_TAX                 9283  non-null values
REV_SALES_USE_TAX          9283  non-null values
REV_OCCUPATION_TAX         9282  non-null values
REV_FRANCHISE_TAX          9283  non-null values
REV_OTHER_TAX              9283  non-null values
REV_UNCLASS_TAX            9281  non-null values
REV_LICENSES               9283  non-null values
REV_CHARGES                9282  non-null values
REV_FINES                  9283  non-null values
REV_TRANSFER_IN            9283  non-null values
REV_INTGOVT                9283  non-null values
REV_HUT                    9283  non-null values
REV_CIGARETTE_TAX          9283  non-null values
REV_MOTOR_VEH_FEE          9283  non-null values
REV_CTF                    9283  non-null values
ALL_OTHER_INTGOVT          9283  non-null values
REV_MISC                   9283  non-null values
REV_INTEREST               9283  non-null values
EXP_TOTAL                  9283  non-null values
EXP_TOTAL_OPERATING        9283  non-null values
EXP_GEN_GOVT               9283  non-null values
EXP_JUDICIAL               9283  non-null values
EXP_TOTAL_PUBLIC_SAFETY    9283  non-null values
EXP_POLICE                 9283  non-null values
EXP_FIRE                   9282  non-null values
EXP_OTHER_PUBLIC_SAFETY    9283  non-null values
EXP_TOTAL_PUBLIC_WORKS     9283  non-null values
EXP_STREET                 9283  non-null values
EXP_TRASH                  9283  non-null values
EXP_OTHER_PUBLIC_WORKS     9283  non-null values
EXP_HEALTH                 9282  non-null values
EXP_RECREATION             9283  non-null values
EXP_MISC                   9282  non-null values
EXP_TRANSFER_OUT           9283  non-null values
EXP_CAPITAL_OUTLAY         9283  non-null values
EXP_DEBT_SERVICE_GEN       9283  non-null values
EXP_PRINCIPAL_GEN          9283  non-null values
EXP_INTEREST_GEN           9283  non-null values
GO_DEBT_GEN                9282  non-null values
REVENUE_DEBT_GEN           9283  non-null values
OTHER_DEBT_GEN             9283  non-null values
ASSETS                     9283  non-null values
LIABILITIES                9283  non-null values
POPULATION                 9283  non-null values
RURAL_VEH_REG              9283  non-null values
MUN_VEH_REG                9283  non-null values
RETAIL_SALES               9283  non-null values
ST_SALES_TAX_PAID          9283  non-null values
SALES_TAX_RATE             9282  non-null values
ROADS                      9283  non-null values
dtypes: float64(9), int64(46), object(1)

Validation

It looks like general fund data are in the 'gg' files, so let's start there. There are fewer county records, so that can be the guinea pig. The DF is too wide to display, so we will display the head() in chunks.

In [62]:
for i in range(5):
    print county_gg.head(5)[county_gg.columns[i*10:(i+1)*10]].to_string()
print county_gg.head(5)[county_gg.columns[50:]].to_string()
   LG_ID          NAME  LGTYPE_ID  AUDIT_YEAR  REV_TOTAL  REV_TOTAL_TAX  REV_PROPERTY_TAX  REV_SO_TAX  REV_SALES_USE_TAX  REV_OCCUPATION_TAX
0   1001  Adams County          1        1975   27622900        9229100           8248100      898700                  0                   0
1   1001  Adams County          1        1976   32777100       12619700          11283900     1256200                  0                   0
2   1001  Adams County          1        1977   35305800       13598500          12402300     1164600                  0                   0
3   1001  Adams County          1        1978   43821800       15485500          13881200     1558300                  0                   0
4   1001  Adams County          1        1979   44203200       16403900          14584300     1780800                  0                   0
   REV_FRANCHISE_TAX  REV_OTHER_TAX  REV_LODGING_TAX  REV_REAL_ESTATE_TRANSFER_TAX  REV_UNCLASS_TAX  REV_LICENSES  REV_CHARGES  REV_FINES  REV_TRANSFER_IN  REV_INTGOVT
0                  0          82300                0                             0                0        163900      1361100          0                0     16008800
1                  0          79600                0                             0                0        200000      1557300          0                0     17276500
2                  0          31600                0                             0                0        206600      1977000          0                0     18554200
3                  0          46000                0                             0                0        204900      2240600          0                0     24679000
4                  0          38800                0                             0                0        258000      2317100          0                0     22927600
   REV_HUT  REV_CIGARETTE_TAX  REV_MOTOR_VEH_FEE  REV_CTF  REV_SOCIAL_SERVICE  ALL_OTHER_INTGOVT  REV_MISC  REV_INTEREST  EXP_TOTAL  EXP_TOTAL_OPERATING
0  1072000             317900             234400        0            10972400            3412100    860000        463876   27963500             25221500
1  1046000             343300             114500        0            11315500            4457200   1123600        668397   31194700             27476200
2  1162000             363700             255700        0            11946200            4826600    969500        656143   35153600             31761200
3  1299900             339600             327400        0            12843000            9869100   1211800        961809   41602200             39802200
4  1451300             394100             220100        0            13042200            7819900   2296600       1746913   41577000             39673900
   EXP_GEN_GOVT  EXP_JUDICIAL  EXP_TOTAL_PUBLIC_SAFETY  EXP_POLICE  EXP_FIRE  EXP_OTHER_PUBLIC_SAFETY  EXP_TOTAL_PUBLIC_WORKS  EXP_STREET  EXP_TRASH  EXP_OTHER_PUBLIC_WORKS
0       4666500        743800                  2408700     1927100         0                   481600                 1617000     1608300       3400                    5300
1       5111500        901800                  2807500     2220200         0                   587300                 2313400     2305800          0                    7600
2       5038400       1036300                  3031700     2715000         0                   316700                 2419000     2410400          0                    8600
3       5364500       1321700                  3264100     2929800         0                   334300                 3351800     3342700          0                    9100
4       5897700       1544900                  3585500     3254400         0                   331100                 3650300     3641600          0                    8700
   EXP_HEALTH  EXP_RECREATION  EXP_SOCIAL_SERVICE  EXP_MISC  EXP_TRANSFER_OUT  EXP_CAPITAL_OUTLAY  EXP_DEBT_SERVICE_GEN  EXP_PRINCIPAL_GEN  EXP_INTEREST_GEN  GO_DEBT_GEN
0      964600         1474900            13290500     55500            546600             2195400                     0                  0                 0            0
1     1078300         1509400            13738200     16100            458900             3259600                     0                  0                 0            0
2     1357100         1947200            14448600   2482900            622000             2770400                     0                  0                 0            0
3     1430200         1802200            15449800   7817900            652200             1147800                     0                  0                 0            0
4     1549300         1894200            16255500   5296500            756500             1146600                     0                  0                 0            0
   REVENUE_DEBT_GEN  OTHER_DEBT_GEN  ASSETS  LIABILITIES  POPULATION  RETAIL_SALES  ST_SALES_TAX_PAID  SALES_TAX_RATE
0                 0               0       0            0      216880       1143432           20346562               0
1                 0               0       0            0      222739       1288492           24022048               0
2                 0               0       0            0      226135       1514871           28278304               0
3                 0               0       0            0      239634       1844429           35943168               0
4                 0               0       0            0      246063       2198059           43148344               0

Ok, I am assuming that LG_ID is a unique identifier for each jurisdiction. Lucky for me, I have a handy function to make this easy...

In [63]:
def one2one(seq1,seq2):
    seq_dict=dict(zip(seq1,seq2))
    values=[]
    for key in seq_dict:
        values.append(seq_dict[key])
    print len(seq_dict.keys()),'|',len(set(values))
    return len(seq_dict.keys())==len(set(values))

one2one(county_gg['LG_ID'],county_gg['NAME'])
64 | 64

Out[63]:
True

So, they are unique identifiers, which means either can be used in the index. Since LG_ID doesn't provide additional information, no use carrying dead weight.

In [332]:
#Drop county ID
county_gg.pop('LG_ID')
Out[332]:
0     1001
1     1001
2     1001
3     1001
4     1001
5     1001
6     1001
7     1001
8     1001
9     1001
10    1001
11    1001
12    1001
13    1001
14    1001
...
2225    64030
2226    64030
2227    64030
2228    64030
2229    64030
2230    64030
2231    64030
2232    64030
2233    64030
2234    64030
2235    64030
2236    64030
2237    64030
2238    64030
2239    64030
Name: LG_ID, Length: 2240, dtype: int64

I am also going to throw year and jurisdictional type into the index, but first I need to look at how many jurisdictional types we have. It's probably also worth checking the completeness of years.

In [164]:
#Count jurisdictions
print county_gg['LGTYPE_ID'].value_counts()

#Do all years have the same number of obs?
plt.rcParams['figure.figsize']=10,3
print county_gg['AUDIT_YEAR'].value_counts().sort_index().plot(kind='bar',title='Completeness by year')

#Set index
c_gg=county_gg.set_index(['NAME','AUDIT_YEAR','LGTYPE_ID'])
1     2170
61      35
70      35
dtype: int64
Axes(0.125,0.125;0.775x0.775)

So we have full coverage on the years, and we have discovered two additional jurisdictional types. I assume type 1 is a general purpose jurisdiction, while the others (types 61 and 70) represent special purpose districts of some sort. If we can't tell from the names, we will have to check this out a bit later.

To check these items, we will examine a cross-section by LGTYPE_ID and pull the NAME attribute from the index. We really only care about unique names, so set() again comes into play.

In [165]:
print set(c_gg.xs(61,level='LGTYPE_ID').index.get_level_values(level='NAME'))
print set(c_gg.xs(70,level='LGTYPE_ID').index.get_level_values(level='NAME'))
set(['Denver, City And County of'])
set(['Broomfield, City and County of'])

Both 61 and 70 identify singular entities. The odd thing is that they are both City/County combinations, so I am unclear on why they have different LGTYPE_ID.

In any event, just to provide an easy reference, here are the rest of the jurisdictions.

In [166]:
print set(c_gg.xs(1,level='LGTYPE_ID').index.get_level_values(level='NAME'))
set(['Gilpin County', 'Las Animas County', 'Morgan County', 'Alamosa County', 'Dolores County', 'Summit County', 'Jefferson County', 'Pueblo County', 'Clear Creek County', 'Bent County', 'Teller County', 'Rio Grande County', 'Montrose County', 'Adams County', 'Moffat County', 'San Miguel County', 'Larimer County', 'Mesa County', 'Archuleta County', 'Sedgwick County', 'Arapahoe County', 'Hinsdale County', 'Lincoln County', 'Jackson County', 'Mineral County', 'Weld County', 'Huerfano County', 'Montezuma County', 'Routt County', 'Garfield County', 'Custer County', 'Saguache County', 'Ouray County', 'La Plata County', 'Otero County', 'Boulder County', 'Baca County', 'Conejos County', 'Lake County', 'Washington County', 'Eagle County', 'Douglas County', 'Logan County', 'Costilla County', 'Kiowa County', 'Yuma County', 'Chaffee County', 'Delta County', 'Phillips County', 'El Paso County', 'Fremont County', 'Rio Blanco County', 'San Juan County', 'Gunnison County', 'Elbert County', 'Cheyenne County', 'Park County', 'Kit Carson County', 'Prowers County', 'Pitkin County', 'Crowley County', 'Grand County'])

Cleaning/Munging

We have too many columns to deal with at the current time, and it appears that they are hierarchical. Fortunately, the Division of Local Government in CO has provided data summaries by year that reveal the nesting behavior. Consequently, we can reduce these to some extent.

In [255]:
HTML('<iframe src=http://www.colorado.gov/cs/Satellite?blobcol=urldata&blobheadername1=Content-Disposition&blobheadername2='+
'Content-Type&blobheadervalue1=inline%3B+filename%3D%222009+County+Compendium+Information.pdf%22&blobheadervalue2=application%2Fpdf&blobkey='+
'id&blobtable=MungoBlobs&blobwhere=1251848971158&ssbinary=true width=1000 height=950></iframe>')
Out[255]:
In [167]:
#for coll in c_gg.columns:
#    print coll

#Construct hierarchical dictionaries    
c_rev_dict={'REV_TOTAL':['REV_TOTAL_TAX','REV_LICENSES','REV_INTGOVT','REV_CHARGES','REV_FINES','REV_MISC','REV_TRANSFER_IN'],
          'REV_TOTAL_TAX':['REV_PROPERTY_TAX','REV_SO_TAX','REV_SALES_USE_TAX','REV_FRANCHISE_TAX','REV_OCCUPATION_TAX','REV_OTHER_TAX','REV_UNCLASS_TAX'],
          'REV_INTGOVT':['REV_HUT','REV_CIGARETTE_TAX','REV_MOTOR_VEH_FEE','REV_CTF','REV_SOCIAL_SERVICE','ALL_OTHER_INTGOVT']}
c_exp_dict={'EXP_TOTAL':['EXP_TOTAL_OPERATING','EXP_TRANSFER_OUT','EXP_CAPITAL_OUTLAY','EXP_DEBT_SERVICE_GEN'],
          'EXP_TOTAL_OPERATING':['EXP_GEN_GOVT','EXP_JUDICIAL','EXP_TOTAL_PUBLIC_SAFETY','EXP_TOTAL_PUBLIC_WORKS','EXP_HEALTH','EXP_RECREATION','EXP_SOCIAL_SERVICE','EXP_MISC'],
          'EXP_TOTAL_PUBLIC_SAFETY':['EXP_POLICE','EXP_FIRE','EXP_OTHER_PUBLIC_SAFETY'],
          'EXP_TOTAL_PUBLIC_WORKS':['EXP_STREET','EXP_TRASH','EXP_OTHER_PUBLIC_WORKS'],
          'EXP_DEBT_SERVICE_GEN':['EXP_PRINCIPAL_GEN','EXP_INTEREST_GEN']}
c_debt_dict={'TOTAL_DEBT':['GO_DEBT_GEN','REVENUE_DEBT_GEN','OTHER_DEBT_GEN']}

There are a few things to note here: >1. TOTAL_DEBT does not exist in the current set, so we will create it by summing the other outstanding debt variables (those found in the values array of debt_dict);

  1. The data include information about sales tax rate, retail sales activity, and state sales tax collected. We will leave this alone for the time being; and,
  1. The data include population information, which will be retained to provide per capita perspectives.

To test the validity of our nesting scheme, the first thing we should do is subset to a single jurisdiction and see if our values add up. Adams County should work as well as any other.

The validation check loops through each key of each dictionary created. It then identifies the 'total' column using the key, and the 'component' columns using the values associated with said key. A boolean check for equality of the total and sum of the components is deposited into a Series object (called 'same' below). This happens because the high-level boolean check is actually a vectorized operation. The check is thus performed for each year in the 'total' and 'component' Series (which I think need to be of the same length). If the check encounters an inequality, it prints False, the index position, the corresponding component values, and the difference between the given 'total' value and the value calculated as the sum of the 'components'. It is important to capture both the component values and the difference because if the difference corresponds to one of the component values, we know which one shouldn't be there.

In [168]:
#Create TOTAL_DEBT
c_gg['TOTAL_DEBT']=c_gg[['GO_DEBT_GEN','REVENUE_DEBT_GEN','OTHER_DEBT_GEN']].sum(axis=1)

#Generate Adams County subset
adams=c_gg.ix['Adams County']

for dct in [c_rev_dict,c_exp_dict,c_debt_dict]:
    for key in dct:
        same=adams[key]==adams[dct[key]].sum(axis=1)
        for i in range(len(same)):
            if same[i]==False:
                print 'KEY (',key,'): THE FOLLOWING COMPONENTS ARE NOT FULLY SPECIFIED\n'
                print same[i], same.index[i]
                print adams[dct[key]][i:i+1],'\n'
                print 'DIFFERENCE >>> ',adams[key][i:i+1]-adams[dct[key]].sum(axis=1)[i:i+1]
            else:
                pass

This ended up being an important check because I did add in something that shouldn't be there. For TOTAL_EXP, I was unsure of what to do with EXP_DEBT_SERVICE_GEN. It appears in the input data, but it does not appear in the data summaries mentioned above (on the website). I figured that it should just be grouped with principal (EXP_PRINCIPAL_GEN) and interest (EXP_INTEREST_GEN) payments, but this is not actually the case. As it turns out, EXP_DEBT_SERVICE_GEN was actually a summary wrapper for EXP_PRINCIPAL_GEN and EXP_INTEREST_GEN, and thus warranted its own key with these items as components.

Deflation

In any case, we are all good now and explore the data a bit. We can take a look at the very wide angle view capturing total revenues over time, but first, we should deflate the values. We can pull CPI information from the FRED database.

In [169]:
#Open feed to CPI info
cpi=web.get_data_fred('USACPIBLS','1/1/1975','1/1/2009')

#Create copy in case offline work is occurring
cpi.to_csv(workdir+'us_cpi_1975_2009.csv')

#Create common index for later join with data
cpi['AUDIT_YEAR']=range(1975,2010)
cpi2=cpi.reset_index().set_index('AUDIT_YEAR')

#Rename columns
cpi2.columns=['DATE','CPI']

#Drop DATE
cpi2.pop('DATE')

#Calculate deflator ratios
dfl=cpi2.div(cpi2.ix[2009])

dfl.head()
Out[169]:
CPI
AUDIT_YEAR
1975 0.250816
1976 0.265268
1977 0.282517
1978 0.303963
1979 0.338462

Now we can join with the county set for deflation. Notice the technique here, it's basically an implicit join. In fact, this method renders irrelevant the name matching I did above. This is a much faster way of matching on a MultiIndex than I have attempted in the past. Longer head() snapshots reveal a correct merge.

In [175]:
#Implicit merge
c_gg['DFL']=dfl.ix[c_gg.index.get_level_values(level='AUDIT_YEAR')].values
print c_gg[['REV_TOTAL','DFL']].head(5)
                                   REV_TOTAL       DFL
NAME         AUDIT_YEAR LGTYPE_ID                     
Adams County 1975       1           27622900  0.250816
             1976       1           32777100  0.265268
             1977       1           35305800  0.282517
             1978       1           43821800  0.303963
             1979       1           44203200  0.338462

Now we need to broadcast the DFL across columns.

In [195]:
#Divide each column by deflator ratio
cgg_real=c_gg.div(c_gg['DFL'],axis=0)

Exploration of County Fiscal Data (Non-Enterprise)

First a few summary plots looking at the trend in revenue from all counties over time.

In [202]:
#Group by year
cgg_yr=cgg_real.groupby(level='AUDIT_YEAR')

#Plot total, avg, and median revenues over time
plt.rcParams['figure.figsize']=20,10
fig,axes=plt.subplots(1,2)
cgg_yr['REV_TOTAL'].sum().plot(kind='line',ax=axes[0],linewidth=3,title='Total Revenues from All Counties in Colorado')
cgg_yr['REV_TOTAL'].mean().plot(kind='line',ax=axes[1],linewidth=3,label='AVG')
cgg_yr['REV_TOTAL'].median().plot(kind='line',ax=axes[1],linestyle='--',linewidth=3,title='Average and Median Revenues from All Counties in Colorado',label='MED')

#Generate a box plot of REV_TOTAL
cgg_real.reset_index('AUDIT_YEAR').boxplot(column='REV_TOTAL',by='AUDIT_YEAR')
Out[202]:
<matplotlib.axes.AxesSubplot at 0x16024910>

Note that while revenues are growing over time in real terms (even for the median jurisdiction), the spread between average and median revenues is growing. This is reaffirmed by the expanding inter-quartile spread in the boxplot, and can have implications for exacerbating the impact of tax and expenditure limitations. In other words, it looks like it will be easier to tease out categorical differences between jurisdictions over time.

It would be interesting to know how the shape of the distribution has changed over time.

In [203]:
#Create cross-sections starting in 1979 and preceding at decade clips to 2009 and plot them
plt.rcParams['figure.figsize']=20,10
for i in range(4):
    c_gg_cross=cgg_real.xs((1979+i*10),level='AUDIT_YEAR')
    c_gg_cross['REV_TOTAL'].plot(kind='kde',linewidth=3,title='Evolution of County Revenue Shape Over Time',label=str((1979+i*10)))
    plt.xlim([-500000000,1500000000])
    plt.legend(loc='best')
    

In real terms, the distribution of revenues is flattening over time. The peaks are shifting rightward, and the non-overlapping area relative to previous years is larger on the positive side. This suggests that even if the absolute spread between median and average jurisdictions is spreading, the relative spread may be going in the other direction.

A more continuous display may be useful...

In [287]:
#Create cross-sections starting in 1979 and preceding at decade clips to 2009 and plot them
plt.rcParams['figure.figsize']=20,10
for i in range(31):
    c_gg_cross=cgg_real.xs((1979+i),level='AUDIT_YEAR')
    c_gg_cross['REV_TOTAL'].plot(kind='kde',linewidth=3,title='Evolution of County Revenue Shape Over Time',color=[(i/float(31)),((i+1)/float(32)),((i+2)/float(33))],label=str((1979+i)))
    plt.xlim([-500000000,1500000000])
    plt.legend(bbox_to_anchor=(0.0,-.23,1.,.0),loc=3,ncol=13)

plt.figure()
for i in range(31):
    c_gg_cross=cgg_real.xs((1979+i),level='AUDIT_YEAR')
    c_gg_cross['REV_TOTAL'].plot(kind='kde',linewidth=3,title='Evolution of County Revenue Shape Over Time',color=(((31-i)/float(31)),0.,(i/float(31))),label=str((1979+i)))
    plt.xlim([-500000000,1500000000])
    plt.legend(bbox_to_anchor=(0.0,-.23,1.,.0),loc=3,ncol=13)

To get a continuous color distribution, we had to utilize RGB sequences. Each sequence is a tuple of three values on the interval [0-1]. To get subtle changes across each year, we employed the iterator to make small changes in one or more of the three sequence positions. More information on the matplotlib (the default graphing base for pandas) treatment of colors can be found here.

It would be interesting to see if expenditures follow a similar pattern. The next plot uses only three time periods and overlays expenditure distribution with dotted lines.

In [211]:
plt.rcParams['figure.figsize']=20,10
colors=['r','b','g']
for i in range(3):
    c_gg_cross=cgg_real.xs((1979+i*15),level='AUDIT_YEAR')
    c_gg_cross['REV_TOTAL'].plot(kind='kde',linewidth=3,title='Evolution of County Revenue Shape Over Time',color=colors[i],label=str(1979+i*15)+'REV')
    c_gg_cross['EXP_TOTAL'].plot(kind='kde',linewidth=3,linestyle='--',title='Evolution of County Revenue/Expenditure Shape Over Time',color=colors[i],label=str(1979+i*15)+'EXP')
    plt.xlim([-500000000,1500000000])
    plt.legend(loc='best')

There is a noticeable consistency in pattern, but the disparity between revenues and expenditures in 2009 is dramatic. This is likely a consequence of the bursting housing bubble. Sure would be nice to have another four years of data. Perhaps that can still be acquired.

Ok, let's get into the compositional aspects. For the time being, we are still using the set aggregated by year and we will look at revenues and then expenditures. For each we will consider compositional shifts, absolute growth, and indexed growth. (Debt dynamics will be assessed when actively required.)

In [242]:
plt.rcParams['figure.figsize']=25,20
fig,axes=plt.subplots(3)
i=0
for key in c_rev_dict.keys():
    #Subset to columns of interest
    cols=[key]+c_rev_dict[key]
    sub_df=cgg_yr.sum().reindex(columns=[cols])
    sub_df[c_rev_dict[key]].div(sub_df[key],axis=0).plot(kind='bar',stacked=True,ax=axes[i],title='Composition of '+key+' Revenue over Time')
    axes[i].legend(bbox_to_anchor=(0.03,-.33,1.,.0),loc=3,ncol=len(c_rev_dict[key])+1)
    i+=1

Absolute Growth...

In [243]:
fig,axes=plt.subplots(3)
i=0
for key in c_rev_dict.keys():
    #Subset to columns of interest
    cols=[key]+c_rev_dict[key]
    sub_df=cgg_yr.sum().reindex(columns=[cols])
    sub_df[c_rev_dict[key]].div(sub_df[key],axis=0).plot(kind='line',ax=axes[i],linewidth=4,title='Proportional Growth of '+key+' Component Revenue over Time')
    axes[i].legend(bbox_to_anchor=(0.03,-.33,1.,.0),loc=3,ncol=len(c_rev_dict[key])+1)
    i+=1

Index Growth...

In [244]:
fig,axes=plt.subplots(3)
i=0
for key in c_rev_dict.keys():
    #Subset to columns of interest
    cols=c_rev_dict[key]
    sub_df=cgg_yr.sum().reindex(columns=[cols])
    DataFrame((sub_df/sub_df.ix[1975])*100).plot(kind='line',ax=axes[i],linewidth=5,title='Indexed Absolute Growth of '+key+' Revenue over Time')
    axes[i].legend(bbox_to_anchor=(0.03,-.33,1.,.0),loc=3,ncol=len(c_rev_dict[key])+1)
    i+=1

While property tax is clearly the dominant factor in county finances, there are some other interesting things going. For one, the fastest growing revenue source of the period has been transfers from enterprises. It remains a relatively small factor, but it's growth is undeniable. Couple this with marked increase in franchise taxes relative to other tax collections, and it suggests a greater level of business activity.

Secondly, sales and use tax made an aggressive push to approach property tax for dominance in the late 1990s, but there was a sharp deflection in 2000. This compositional action was driven almost entirely by sales and use growth rather than a property decline.

It is also interesting to note that most of the intergovernmental sources of revenue experienced flat or declining growth over the period. From a compositional standpoint, total intergovernmental importance declined in importance of the period, and it appears tax collections picked up the slack.

Replicating these views for expenditures...

In [245]:
plt.rcParams['figure.figsize']=25,25
fig,axes=plt.subplots(5)
i=0
for key in c_exp_dict.keys():
    #Subset to columns of interest
    cols=[key]+c_exp_dict[key]
    sub_df=cgg_yr.sum().reindex(columns=[cols])
    sub_df[c_exp_dict[key]].div(sub_df[key],axis=0).plot(kind='bar',stacked=True,ax=axes[i],title='Composition of '+key+' Spending over Time')
    axes[i].legend(bbox_to_anchor=(0.03,-.33,1.,.0),loc=3,ncol=len(c_exp_dict[key])+1)
    i+=1
In [246]:
fig,axes=plt.subplots(5)
i=0
for key in c_exp_dict.keys():
    #Subset to columns of interest
    cols=[key]+c_exp_dict[key]
    sub_df=cgg_yr.sum().reindex(columns=[cols])
    sub_df[c_exp_dict[key]].div(sub_df[key],axis=0).plot(kind='line',ax=axes[i],linewidth=4,title='Proportional Growth of '+key+' Component Spending over Time')
    axes[i].legend(bbox_to_anchor=(0.03,-.33,1.,.0),loc=3,ncol=len(c_exp_dict[key])+1)
    i+=1
In [248]:
fig,axes=plt.subplots(5)
i=0
for key in c_exp_dict.keys():
    #Subset to columns of interest
    cols=c_exp_dict[key]
    sub_df=cgg_yr.sum().reindex(columns=[cols])
    DataFrame((sub_df/sub_df.ix[1975])*100).plot(kind='line',ax=axes[i],linewidth=5,title='Indexed Absolute Growth of '+key+' Spending over Time')
    axes[i].legend(bbox_to_anchor=(0.03,-.33,1.,.0),loc=3,ncol=len(c_exp_dict[key])+1)
    i+=1

Perhaps the most interesting part of the expenditure side is categorically different behavior from revenues with respect to growth. Expenditures are increasing without regard to revenue behavior. Health spending stands out as a growing portion of the budget, while social services has scaled back. None of this is really surprising. In fact, that's the point. Budget pressures will exist whether or not the levy is constrained.

Ok, I want to know about what's happening below the aggregated surface. Are we seeing individual jurisdictions following the broader trends? To compare with total growth across jurisdictions and with the whole, it will be useful to generate a set of per capita fiscal behavior. In deflating the values for charting, we also deflated the population, which is clearly ridiculous. Consequently, we need to reintegrate the old population figure.

In [297]:
#Integrate old population figure into real value DF
cgg_real['POP']=c_gg.ix[cgg_real.index]['POPULATION'].values

#Restrict columns
cols=['REV_TOTAL', 'REV_TOTAL_TAX', 'REV_PROPERTY_TAX', 'REV_SO_TAX', 'REV_SALES_USE_TAX', 'REV_OCCUPATION_TAX', 'REV_FRANCHISE_TAX', 'REV_OTHER_TAX', \
      'REV_LODGING_TAX', 'REV_REAL_ESTATE_TRANSFER_TAX', 'REV_UNCLASS_TAX', 'REV_LICENSES', 'REV_CHARGES', 'REV_FINES', 'REV_TRANSFER_IN', 'REV_INTGOVT', \
      'REV_HUT', 'REV_CIGARETTE_TAX', 'REV_MOTOR_VEH_FEE', 'REV_CTF', 'REV_SOCIAL_SERVICE', 'ALL_OTHER_INTGOVT', 'REV_MISC', 'REV_INTEREST', 'EXP_TOTAL', \
      'EXP_TOTAL_OPERATING', 'EXP_GEN_GOVT', 'EXP_JUDICIAL', 'EXP_TOTAL_PUBLIC_SAFETY', 'EXP_POLICE', 'EXP_FIRE', 'EXP_OTHER_PUBLIC_SAFETY', 'EXP_TOTAL_PUBLIC_WORKS', \
      'EXP_STREET', 'EXP_TRASH', 'EXP_OTHER_PUBLIC_WORKS', 'EXP_HEALTH', 'EXP_RECREATION', 'EXP_SOCIAL_SERVICE', 'EXP_MISC', 'EXP_TRANSFER_OUT', 'EXP_CAPITAL_OUTLAY', \
      'EXP_DEBT_SERVICE_GEN', 'EXP_PRINCIPAL_GEN', 'EXP_INTEREST_GEN', 'GO_DEBT_GEN', 'REVENUE_DEBT_GEN', 'OTHER_DEBT_GEN', 'ASSETS', 'LIABILITIES', \
      'RETAIL_SALES', 'ST_SALES_TAX_PAID', 'TOTAL_DEBT', 'POP']
cgg_real2=cgg_real.reindex(columns=cols)

#Calculate per capita set
cgg_rpc=cgg_real.div(cgg_real['POP'],axis=0)

Now we can take a look...

In [325]:
#Group by year
cgg_rpcg=cgg_rpc.groupby(level='AUDIT_YEAR')

#Plot median and average per capita revenues by year
plt.rcParams['figure.figsize']=20,10
#fig,axes=plt.subplots(1)
#Populate subplot #1
cgg_rpcg['REV_TOTAL'].mean().plot(kind='line',linewidth=3,title='Average Per Capita Revenues from All Counties in Colorado',label='AVG')
#Populate sublot #2
cgg_rpcg['REV_TOTAL'].median().plot(kind='line',linewidth=3,linestyle='--',title='Median Per Capita Revenues from All Counties in Colorado',label='MED')


"""
#Generate list of jurisdictions
county_list=[]
for county in cgg_rpc.index.get_level_values(level='NAME'):
    county_list.append(county)
print set(county_list)
"""
unsorted_counties=['Gilpin County', 'Las Animas County', 'Morgan County', 'Alamosa County', 'Dolores County', 'Summit County', 'Jefferson County', \
                   'Pueblo County', 'Jackson County', 'Clear Creek County', 'Bent County', 'Teller County', 'Rio Grande County', 'Montrose County', \
                   'Adams County', 'Moffat County', 'San Miguel County', 'Larimer County', 'Mesa County', 'Archuleta County', 'Sedgwick County', \
                   'Arapahoe County', 'Hinsdale County', 'Lincoln County', 'Denver, City And County of', 'Mineral County', 'Weld County', 'Huerfano County', \
                   'Montezuma County', 'Routt County', 'Garfield County', 'Custer County', 'Saguache County', 'Ouray County', 'La Plata County', \
                   'Otero County', 'Boulder County', 'Baca County', 'Conejos County', 'Lake County', 'Washington County', 'Eagle County', 'Douglas County', \
                   'Logan County', 'Costilla County', 'Kiowa County', 'Yuma County', 'Chaffee County', 'Delta County', 'Phillips County', 'El Paso County', \
                   'Fremont County', 'Rio Blanco County', 'San Juan County', 'Gunnison County', 'Elbert County', 'Broomfield, City and County of', \
                   'Cheyenne County', 'Park County', 'Kit Carson County', 'Prowers County', 'Pitkin County', 'Crowley County', 'Grand County']

#Plot average and median per capita revenues over time with individual counties as context
plt.rcParams['figure.figsize']=20,20
fig,axes=plt.subplots(2)
#Populate subplot #1
for county in unsorted_counties:
    cgg_rpc.xs(county,level='NAME').reset_index('LGTYPE_ID')['REV_TOTAL'].plot(kind='line',ax=axes[0],linewidth=1,label=county,alpha=.5)
cgg_rpcg['REV_TOTAL'].mean().plot(kind='line',ax=axes[0],linewidth=10,linestyle='--',title='Average vs Real Per Capita Revenues from All Counties in Colorado',label='AVG')
#Populate sublot #2
for county in unsorted_counties:
    cgg_rpc.xs(county,level='NAME').reset_index('LGTYPE_ID')['REV_TOTAL'].plot(kind='line',ax=axes[1],linewidth=1,label=county,alpha=.5)
cgg_rpcg['REV_TOTAL'].median().plot(kind='line',ax=axes[1],linewidth=10,linestyle='--',title='Median vs Real Per Capita Revenues from All Counties in Colorado',label='MED')

#Generate a box plot of REV_TOTAL
plt.rcParams['figure.figsize']=20,10
cgg_rpc.reset_index('AUDIT_YEAR').boxplot(column='REV_TOTAL',by='AUDIT_YEAR')
Out[325]:
<matplotlib.axes.AxesSubplot at 0x1b973390>

Well the distribution certainly looks a lot more reasonably consistent in per capita terms. In fact, it looks like median and average per capita revenues are very well correlated.

The second two plots demonstrate the substantial variation that occurs around measures of central tendency. The volatility is so great that per capita growth looks almost nonexistent!

What about the shape of the distribution?

In [322]:
for i in range(31):
    c_gg_cross=cgg_rpc.xs((1979+i),level='AUDIT_YEAR')
    c_gg_cross['REV_TOTAL'].plot(kind='kde',linewidth=3,title='Evolution of County Per Capita Revenue Shape Over Time',color=(((31-i)/float(31)),0.,(i/float(31))),label=str((1979+i)))
    plt.xlim([-2000,12000])
    plt.legend(bbox_to_anchor=(0.0,-.23,1.,.0),loc=3,ncol=13)

And now for a look at expenditures...

In [326]:
#Plot median and average per capita expenditures by year
plt.rcParams['figure.figsize']=20,10
#fig,axes=plt.subplots(1)
#Populate subplot #1
cgg_rpcg['EXP_TOTAL'].mean().plot(kind='line',linewidth=3,title='Average Per Capita Expenditures from All Counties in Colorado',label='AVG')
#Populate sublot #2
cgg_rpcg['EXP_TOTAL'].median().plot(kind='line',linewidth=3,linestyle='--',title='Median Per Capita Expenditures from All Counties in Colorado',label='MED')


#Plot average and median per capita expenditures over time with individual counties as context
plt.rcParams['figure.figsize']=20,20
fig,axes=plt.subplots(2)
#Populate subplot #1
for county in unsorted_counties:
    cgg_rpc.xs(county,level='NAME').reset_index('LGTYPE_ID')['EXP_TOTAL'].plot(kind='line',ax=axes[0],linewidth=1,label=county,alpha=.5)
cgg_rpcg['EXP_TOTAL'].mean().plot(kind='line',ax=axes[0],linewidth=10,linestyle='--',title='Average vs Real Per Capita Spending from All Counties in Colorado',label='AVG')
#Populate sublot #2
for county in unsorted_counties:
    cgg_rpc.xs(county,level='NAME').reset_index('LGTYPE_ID')['EXP_TOTAL'].plot(kind='line',ax=axes[1],linewidth=1,label=county,alpha=.5)
cgg_rpcg['EXP_TOTAL'].median().plot(kind='line',ax=axes[1],linewidth=10,linestyle='--',title='Median vs Real Per Capita Spending from All Counties in Colorado',label='MED')

#Generate a box plot of REV_TOTAL
plt.rcParams['figure.figsize']=20,10
cgg_rpc.reset_index('AUDIT_YEAR').boxplot(column='EXP_TOTAL',by='AUDIT_YEAR')
Out[326]:
<matplotlib.axes.AxesSubplot at 0x2801b7d0>

To my mind, per capita expenditures look a bit more volatile than per capita revenues. Is that actually the case? Coefficient of variation might provide a reasonable view.

In [330]:
DataFrame(cgg_rpcg.std()/cgg_rpcg.mean())[['REV_TOTAL','EXP_TOTAL']].plot(kind='bar',title='CoV for Per Capita Revenues & Expenditures')
Out[330]:
<matplotlib.axes.AxesSubplot at 0x24540510>

And that's why you check these things! Evidently they are quite similarly volatile.

What about the expenditure distribution over time?

In [331]:
for i in range(31):
    c_gg_cross=cgg_rpc.xs((1979+i),level='AUDIT_YEAR')
    c_gg_cross['EXP_TOTAL'].plot(kind='kde',linewidth=3,title='Evolution of County Per Capita Expenditure Shape Over Time',color=(((31-i)/float(31)),0.,(i/float(31))),label=str((1979+i)))
    plt.xlim([-2000,12000])
    plt.legend(bbox_to_anchor=(0.0,-.23,1.,.0),loc=3,ncol=13)

Classification

Ultimately, my goal is to be able to characterize sub-state jurisdictions. I think the above view give enough of a wide angle view, but it doesn't add much value to repeat the compositional views unless I can draw distinctions across types of within-class jurisdictions. Currently, I am looking at counties, but any of the classification methods can be extended to local governments as well.

The first thing that comes to mind are scale effects. Before integration of TEL effects, it's important to understand whether or not simply having more people influences the efficiency of service delivery as measured by revenues or expenditure per capita. It would then be useful to see if growth trajectories look different according to scale. Exploratory analysis will utilize basic OLS, and then cursory Panel analysis.

To start, pooled OLS will suffice. For the sake of procedural familiarity, we will leave only time in the index. However, we must first reintegrate population with the per capita set (it was reduced to 1 when calculating per capita values).

In [355]:
#Reintegrate population
cgg_rpc['POP2']=c_gg.ix[cgg_rpc.index]['POPULATION'].values

#Limit available columns
ols_cols=[x for x in cgg_rpc.columns if x not in ['POPULATION','DFL','POP','SALES_TAX_RATE']]

#Isolate time in index
cgg_ols=cgg_rpc.reset_index(['NAME','LGTYPE_ID'])[ols_cols]

#Creat column slice lists
ols_y1='REV_TOTAL'
ols_y2='EXP_TOTAL'
ols_x=['POP2','RETAIL_SALES']

#Run OLS
rev_ols=pd.ols(y=cgg_ols[ols_y1],x=cgg_ols[ols_x])
exp_ols=pd.ols(y=cgg_ols[ols_y2],x=cgg_ols[ols_x])

print rev_ols
print exp_ols

#Capture impacts for display
ols_res=[rev_ols.beta[:2],exp_ols.beta[:2]]
ols_results=pd.concat(ols_res,axis=1)
ols_results.columns=['rev','exp']
print ols_results

#Plot results
plt.rcParams['figure.figsize']=10,5
ols_results.plot(kind='bar',title='OLS Coefficient (Population, Sales Volume Per Capita) Values') 

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <POP2> + <RETAIL_SALES> + <intercept>

Number of Observations:         143360
Number of Degrees of Freedom:   3

R-squared:         0.0107
Adj R-squared:     0.0107

Rmse:            780.9272

F-stat (2, 143357):   776.0319, p-value:     0.0000

Degrees of Freedom: model 2, resid 143357

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
          POP2     0.0001     0.0000       6.82     0.0000     0.0001     0.0001
  RETAIL_SALES     0.0056     0.0001      38.67     0.0000     0.0053     0.0059
     intercept  1246.9088     2.6552     469.61     0.0000  1241.7046  1252.1131
---------------------------------End of Summary---------------------------------


-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <POP2> + <RETAIL_SALES> + <intercept>

Number of Observations:         143360
Number of Degrees of Freedom:   3

R-squared:         0.0108
Adj R-squared:     0.0108

Rmse:            765.5481

F-stat (2, 143357):   783.8645, p-value:     0.0000

Degrees of Freedom: model 2, resid 143357

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
          POP2     0.0001     0.0000       7.18     0.0000     0.0001     0.0002
  RETAIL_SALES     0.0055     0.0001      38.80     0.0000     0.0052     0.0058
     intercept  1227.4888     2.6029     471.58     0.0000  1222.3871  1232.5906
---------------------------------End of Summary---------------------------------

                   rev       exp
POP2          0.000116  0.000120
RETAIL_SALES  0.005605  0.005513

Out[355]:
<matplotlib.axes.AxesSubplot at 0x2a2a9910>

As can be seen, the explanatory power of scale (as measured by population and retail sales volume per capita) is absolutely abysmal. This bodes well, actually, because it leave more explanatory room for tax policy. Here's to keeping our fingers crossed!

In the meantime, let's see if a panel run (with jurisdiction in the index) makes a difference (still pooled OLS).

In [356]:
#Isolate time and jurisdiction in index
cggpc_panel=cgg_rpc.reset_index('LGTYPE_ID')[ols_cols]

#Run pooled OLS
rev_pool=pd.ols(y=cggpc_panel[ols_y1],x=cggpc_panel[ols_x])
exp_pool=pd.ols(y=cggpc_panel[ols_y2],x=cggpc_panel[ols_x])

print rev_pool
print exp_pool

#Capture impacts for display
pool_res=[rev_pool.beta[:2],exp_pool.beta[:2]]
pool_results=pd.concat(pool_res,axis=1)
pool_results.columns=['rev','exp']
print pool_results

#Plot results
plt.rcParams['figure.figsize']=10,5
pool_results.plot(kind='bar',title='Pooled OLS Coefficient (Population, Sales Volume Per Capita) Values') 

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <POP2> + <RETAIL_SALES> + <intercept>

Number of Observations:         2240
Number of Degrees of Freedom:   3

R-squared:         0.1253
Adj R-squared:     0.1245

Rmse:            734.7796

F-stat (2, 2237):   160.2704, p-value:     0.0000

Degrees of Freedom: model 2, resid 2237

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
          POP2    -0.0017     0.0001     -12.99     0.0000    -0.0019    -0.0014
  RETAIL_SALES     0.0137     0.0011      12.56     0.0000     0.0116     0.0158
     intercept  1273.3606    19.9864      63.71     0.0000  1234.1872  1312.5340
---------------------------------End of Summary---------------------------------


-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <POP2> + <RETAIL_SALES> + <intercept>

Number of Observations:         2240
Number of Degrees of Freedom:   3

R-squared:         0.1086
Adj R-squared:     0.1078

Rmse:            727.2224

F-stat (2, 2237):   136.2088, p-value:     0.0000

Degrees of Freedom: model 2, resid 2237

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
          POP2    -0.0015     0.0001     -11.74     0.0000    -0.0017    -0.0012
  RETAIL_SALES     0.0128     0.0011      11.81     0.0000     0.0106     0.0149
     intercept  1252.0585    19.7809      63.30     0.0000  1213.2879  1290.8290
---------------------------------End of Summary---------------------------------

                   rev       exp
POP2         -0.001662 -0.001487
RETAIL_SALES  0.013701  0.012757

Out[356]:
<matplotlib.axes.AxesSubplot at 0x27443cd0>

Hmmm, pooled OLS should have returned the same thing... right? Is it not just treating the set as a large cross-section? I need to explore this further. The increase in explanatory power seems to suggest that it was more an average of within group effects rather than an average effect over all observations...

I don't know, I have been staring at this screen almost constantly for about 11 hours, so maybe I am just having a giant brain fart...

In []: